30 research outputs found

    Evaluation of Text Document Clustering Using K-Means

    Get PDF
    The fundamentals of human communication are language and written texts. Social media is an essential source of data on the Internet, but email and text messages are also considered to be one of the main sources of textual data. The processing and analysis of text data is conducted using text mining methods. Text Mining is the extension of Data Mining to text files to extract relevant information from large amounts of text data and to recognize patterns. Cluster analysis is one of the most important text mining methods. Its goal is the automatic partitioning of a number of objects into a finite set of homogeneous groups (clusters). The objects should be as similar as possible within a group. Objects from different groups, however, should have different characteristics. The starting-point of cluster analysis is a precise definition of the task and the selection of representative data objects. A challenge regarding text documents is their unstructured form, which requires extensive pre-processing. For the automated processing of natural language Natural Language Processing (NLP) is used. The conversion of text files into a numerical form can be performed using the Bag-of-Words (BoW) approach or neural networks. Each data object can finally be represented as a point in a finite-dimensional space, where the dimension corresponds to the number of unique tokens, here words. Prior to the actual cluster analysis, a measure must also be defined to determine the similarity or dissimilarity between the objects. To measure dissimilarity, metrics such as Euclidean distance, for example, are used. Then clustering methods are applied. The cluster methods can be divided into different categories. On the one hand,there are methods that form a hierarchical system, which are also called hierarchical cluster methods. On the other hand, there are techniques that provide a division into groups by determining a grouping on the basis of an optimal homogeneity measure, whereby the number of groups is predetermined. The procedures of this class are called partitioning methods. An important representative is the k-Means method which is used in this thesis. The results are finally evaluated and interpreted. In this thesis, the different methods used in the individual cluster analysis steps are introduced. In order to make a statement about which method seems to be the most suitable for clustering documents, a practical investigation was carried out on the basis of three different data sets

    Data Science Meets Nuclear - What Data Analytics, Computational Intelligence and Machine Learning Can Contribute to Nuclear Waste Management and Nuclear Verification

    No full text
    Data science is multidisciplinary field that deals with the study of all aspects of data right from its generation to processing to converting it into valuable knowledge source. While data science has a wide range of applications, to what extent have new data science methods made their way into research related to nuclear waste management and nuclear verification? And which further research questions in these fields would particularly benefit from the use of new data science methods? In this line, this paper has two objectives: First, to highlight the state-of-the-art of data science in nuclear waste management and nuclear verification. Second, to discuss the potential use of data science. Ideas for data science in nuclear waste management include, e.g., i) facilitating integration, analytics and visualization of data in the comparative selection process for a geological repository site, ii) creating a virtual geological repository system, iii) geological repository monitoring over its life cycle phases. In nuclear verification, data science can make a significant contribution to i) unattended monitoring by using, e.g., seals/tags, surveillance (optical, 2D/3D laser, gamma, etc.), radiation measurements, etc.; ii) perimeter monitoring through surveillance (optical, gamma, thermal, etc., radiation measurements, etc.), and iii) wide area monitoring using satellite imagery, geophysical monitoring, environmental sampling, etc
    corecore